Note that throughout this notebook we're going to be using various packages that you may not have installed. If you encounter an error using one, check to make sure you've installed it. If you haven't, it should be easy to do so with conda.
In this part of the assignment you'll use text from reviews of products on amazon.com to predict what overall score the reviewer gave the product. The dataset is from here. There were lots of different datasets to choose from, but we felt like going with beauty products this time to do something a little different.
There are three datasets within the Part1_Amazon_Reviews folder, one for each of training, validation, and test. There are 198502 reviews total across the dataset, and we've kind of arbitrarily separated them into 60% for training, 30% for validation, and 10% for testing... ish. The exact percentages don't matter too much here.
Take a look at what's inside the data. You can open it in a browser or a text editor. Both should show you tons of text. If you haven't seen .json data before, this is what it looks like. Each review object starts with a { and ends with a }, and contains a bunch of features separated by commas. Each feature has a name and a value.
1.1 (2 points): In 13 words or less per feature, explain what you think each feature in the dataset means in the box below.
print (len(("reviewerID means the ID of the user that reviewed the product ").split()) < 14)
print (len(("asin means the Amazon Standard Identification Number for the product").split()) < 14)
print (len(("reviewerName means the name of the user that reviewed the product").split()) < 14)
print (len(("helpful means the number of likes or dislikes the review got").split()) < 14)
print (len(("reviewText means the content of the review that the user wrote").split()) < 14)
print (len(("overall means the final score that the product was given").split()) < 14)
print (len(("summary means the text the user writes to summarize their review").split()) < 14)
print (len(("unixReviewTime means the time that the review was written in Unix Time").split()) < 14)
print (len(("reviewTime means the date the review was posted.").split()) < 14)
The first thing we have to do is import the data. Normally json data is really easy to import, but this file is in a slightly annoying format. Notably, the reviews don't have commas separating them. Fortunately this is pretty easy to overcome with a bit of iterating. For this assignment, we only care about the review text and the overall score within each review. To save you a bit of googling, you can access individual items within a json object by doing object["featurename"].
1.2 (3 points): Import data into X and y arrays for train, validation and test sets.
We'll give you the imports:
import json
import numpy as np
# Put your resulting data in numpy arrays named:
# "Train_X", "Train_y", "Validation_X", "Validation_y", "Test_X", and "Test_y"
# with X as review text and y as overall score
# Your code goes here
class Review(object):
text = ""
score = 0
# The class "constructor" - It's actually an initializer
def __init__(self, text, score):
self.text = text
self.score = score
Train_X = np.empty((125398,), dtype = object)
Train_y = np.empty((125398,))
Validation_X = np.empty((63104,), dtype= object)
Validation_y = np.empty((63104,))
Test_X = np.empty((10000,), dtype= object)
Test_y = np.empty((10000,))
with open('beauty_train.json','r') as training:
i = 0
for line in training:
row = json.loads(line)
Train_X[i] = row['reviewText']
Train_y[i] = row['overall']
i+=1
with open('beauty_validation.json','r') as validation:
i = 0
for line in validation:
row = json.loads(line)
Validation_X[i] = row['reviewText']
Validation_y[i] = row['overall']
i+=1
with open('beauty_test.json','r') as test:
i = 0
for line in test:
row = json.loads(line)
Test_X[i] = row['reviewText']
Test_y[i] = row['overall']
i+=1
# If you've done this right, you should get (125398,) (125398,) (63104,) (63104,) (10000,) (10000,)
print (Train_X.shape, Train_y.shape, Validation_X.shape, Validation_y.shape, Test_X.shape, Test_y.shape)
We're going to test a how good a bunch of models are at predicting scores on this dataset. It would be cheating to keep training on the training set and testing on the test set, as you could just end up finding the model that bets fits that specific data rather than fitting a variety of data that you may not yet have seen.
Thus, we'll be using cross-validation. You've already seen a bit of this in A2, but here is an explanation.
Before we get to that though, we're going to build a pipeline to prepare the text data to be classified and then build a classifier all in one step. Here is a rather lengthy explanation of what pipelines are. People have tons of different syntactical styles in writing pipelines, but we'll be using the syntax at 14:30 in the video above. It'll look something like this:
# Don't run this, it's just an example. The semicolon at the end isn't part of the pipeline.
"""
model = Pipeline([
('name of transform1', Transform1()),
('name of transform2', Transform2()),
('name of classifier', Classifier()),
])
model.fit(X, y)
"""
;
So what's happening here? "model" is a Pipeline object. It's defined in the "model = ..." block, and used with model.fit(X, y). It works by taking whatever you put into it, in this case X, and y, and running them through each step in the Pipeline. Typically this means each step until the last will transform the data in some way, and the last step will fit a classifier to the transformed data. Once a Pipeline is finished, you can use it by calling one of the methods of the final function in the pipeline. For example, in order to call model.fit(X,y), the last function in the Pipeline needs to be a classifier as transformers don't have the .fit() method.
Note that I'm using Pipeline rather than pipeline above, because these things all apply specifically to the Pipeline class in sklearn, but not necessarily to the broader concept of a ML pipeline
What are the advantages of using a Pipeline? Well:
Convinced yet? If not, that's cool. You're gonna do it anyways. In this pipeline you're going to have three steps. A CountVectorizer transformer, a TfidfTransformer, and a classifier. You'll play with a few different classifiers to see which one works best.
1.3 (2 points) Look up the documentation for CountVectorizer and TfidfTransformer. For each, explain what it's going to do to your data in fewer than 30 words.
print (len(("CountVectorizer takes a lot of text documents, analyzes the contents and builds a dictionary of vocabulary using the tokenized contents.").split()) < 31)
print (len(("TFIDF measures term frequency and inverse document frequency. It measures how often a term shows up in a document, and uses this to score how important a term is.").split()) < 31)
Next, make a Pipeline. The first two steps will be the above two transformers. The third step will be a classifier. We'll be testing out five classifiers here: KNeighborsClassifier, MultinomialNB, LogisticRegression, RandomForestClassifier, and an additional one of your choice. Just put one in for now; we'll swap them in and out later.
Update 10/10/2018 2:49PM - It seems like KNeighbors is taking a long time to run on some computers. If it takes longer than 15 minutes, you can go ahead and substitute it for a different classifier of your choice in sklearn (still 5 total).
1.4 (6 points) Build a Pipeline to do the above using the format shown in the model = Pipeline([ example above.
We'll let you figure out the imports this time.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
model = Pipeline([
('CountVectorizer', CountVectorizer()),
('TfidfTransformer', TfidfTransformer()),
('AdaBoostClassifier', AdaBoostClassifier())
])
model.fit(Train_X, Train_y)
Next we're going to do some cross validation to test out how well each of the classifiers does on the data.
1.5a (6 points) In the cell below, run five-fold cross validation using each of the five classifiers from above using the cross_val_score function in sklearn.model_selection. For each classifier, output both the accuracy score and the weighted f1 score. Write down the averages of each of these over the five folds in the slots below.
You can read about f1 scores in the cross_val_score documentation. Basically f1 is a complement to accuracy that incorporates both precision and recall.
NOTE: Make sure you're doing cross validation on your Validation datasets! To use the Train or Test sets defeats the whole purpose.
# You'll need to import four things
from sklearn.model_selection import cross_val_score
# Your code goes here
scores = cross_val_score(model, Validation_X, Validation_y, cv = 5)
scores2 = cross_val_score(model, Validation_X, Validation_y, cv = 5, scoring = 'f1_weighted')
# Write in your average values below. Three digits is fine:
#print("KNeighborsClassifier_AvgAccuracy = 0.000")
#print("KNeighborsClassifier_AvgF1 = 0.000")
#KNeighbors crashes my computer
print("MultinomialNB_AvgAccuracy = 0.5785")
print("MultinomialNB_AvgF1 = 0.4248")
print("LogisticRegression_AvgAccuracy = 0.6564")
print("LogisticRegression_AvgF1 = 0.609682")
print("RandomForestClassifier_AvgAccuracy = 0.5852")
print("RandomForestClassifier_AvgF1 = 0.4963")
print("AdaBoostClassifier_AvgAccuracy = 0.6159")
print("AdaBoostClassifier_AvgF1 = 0..551796")
print("YourPick_AvgAccuracy(Decision Tree) = 0.5075")
print("YourPick_AvgF1(Decision Tree) = 0.5010")
PART 1.5 OPTIONAL BONUS (3 points or 5 points): 3 points: Build a Pipeline that ends up getting you greater than 0.75 accuracy and greater than 0.65 weighted f1 score. You can use different or additional transformers and whatever classifier you want to use from anywhere, not necessarily just sklearn. Additional 2 points: get greater than 0.80 accuracy and greater than 0.7 weighted f1 score. This is going to be HARD, but there are definitely ways to do this classification far better than we just did.
Finally it's time to see how the model really performs on test data.
1.6 (6 points) Pick the classifier that performed best above, then fit a model on your training data and predict on your test data. Print out your accuracy and f1 scores, and then print out the confusion matrix.
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
modelBest = Pipeline([
('CountVectorizer', CountVectorizer()),
('TfidfTransformer', TfidfTransformer()),
('LogisticRegression', LogisticRegression())
])
modelBest.fit(Train_X, Train_y)
scores = cross_val_score(modelBest, Test_X, Test_y, cv = 5)
scores2 = cross_val_score(modelBest, Test_X, Test_y, cv = 5, scoring = 'f1_weighted')
sum1 = 0
for s in scores:
sum1+=s
avg1 = sum1/5
print("Best Model's Average Accuracy: " +str(avg1))
sum2 = 0
for s in scores2:
sum2+=s
avg2 = sum2/5
print("Best Model's Average F1 Accuracy: " +str(avg2))
Test_y_pred = modelBest.predict(Test_X)
confusion_matrix(Test_y, Test_y_pred)
Congrats! You've made a halfway-decent text classifer that does something vaguely useful!
The next thing we're going to do is try to see if we can create a classifier to tell the difference between images of cats and images of non-cats. We'll be using the pleasantly-named "Pillow" fork of the Python Imaging Library (PIL).
Starring in the cat images portion of this dataset are Taro, Teru, Rhubarb Penelope (Ruby), Mallow, and Hubble, all of whom are members of the 4th year HCII PhD cohort:
# Just run this. Also you can probably get some clues from this for your next steps.
# We figured it was worth giving away answers to show pictures of these cats.
import PIL
import matplotlib.pyplot as plt
from PIL import Image
fig=plt.figure(figsize=(32, 32))
img = Image.open('Part2_Cat_Classifying/Taro.jpg')
title = "Taro"
sp = plt.subplot(5, 1, 1)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Teru.jpg')
title = "Teru"
sp = plt.subplot(5, 1, 2)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Ruby.jpg')
title = "Ruby"
sp = plt.subplot(5, 1, 3)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Mallow.jpg')
title = "Mallow"
sp = plt.subplot(5, 1, 4)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Hubble.jpg')
title = "Hubble"
sp = plt.subplot(5, 1, 5)
sp.set_title(title)
plt.imshow(img)
The non-cat images are a subset of the OASIS image dataset by Benedek Kurdi, explained here. We've removed images from this dataset that show people or animals, just to try to make classification easier.
Because we're only using a subset of the OASIS dataset, please don't use these images for other projects. The full dataset is available for free download at the above link if you want to use it for another purpose.
Example scenery image:
# Just run this
img = Image.open('Part2_Cat_Classifying/Lake5.jpg')
plt.imshow(img)
Unfortunately, the formal sklearn Pipeline function isn't actually terribly useful for this task. However, as it turns out, "pipeline" is a broader term that really just means the steps you do to import your data, clean it, visualize it, select features, model, and analyze results. Sklearn's Pipeline is a very common way to do this for some types of data, but you can do it manually for others (or you can write out functions to create your own version of Pipeline for a particular application).
Keep the structure of Pipelines in the back of your mind, though. We'll come back to it.
Before we formally build our pipeline, let's just take a look at the data a bit and see what we can do with it. Here's a picture of Teru. You can see the dimensions of the picture (1920 x 2560) printed above it.
# Just run this
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
terupicture = Image.open('Part2_Cat_Classifying/cats/1.jpg')
plt.imshow(terupicture)
terupicture.size
Next we can take a look at what's actually stored within the data of this image. If you don't know how pixels work, here is a basic explanation. It's strikingly difficult to find a really clear tutorial of this online, but this is the best we found after a lot of googling. If this video is to be believed, the above cat picture should be represented by a whole lot of pixels (1920 times 2560, in fact) which are each made of three values from zero to 255 (representing Red, Green, and Blue respectively). This should look something like [12, 127, 56], [65, 208, 11], [34, 33, 2], ....
This notebook will just show you a limited subset of these (as printing out ~5 million would be a bit much).
teru_image_array = np.array(terupicture)
print(teru_image_array)
Voila, just as we expected.
For the sake of image processing, let's convert this image to grayscale. Again, we'll expect to see values from 0 to 255, but this time it'll be a list of single numbers rather than groups of three.
2.1 (3 points) Convert the image of Teru to grayscale. Show the new image, and then print out the array of numbers like we did above.
HINT: You don't need to (and shouldn't) import anything else for this. There's already a function in something we've imported that will convert images to grayscale.
grayscale_teru = terupicture.convert('L')
plt.imshow(grayscale_teru)
grayscale_teru_image_array = np.array(grayscale_teru)
print(grayscale_teru_image_array)
Again, this looks as expected. We can do some random other things with PIL like rotating the image or flipping it or cropping it.
2.2a (1 point) Display a grayscale picture of Teru rotated 20 degrees counterclockwise
2.2b (1 point) Display a grayscale picture of Teru with the bottom 1000 pixels cropped off
HINT: again, no imports needed. The rotated picture will have some black space at the corners now. That's fine.
# Rotate here
grayscaleR_teru = grayscale_teru.rotate(20)
plt.imshow(grayscaleR_teru)
# Crop here
width, height = grayscale_teru.size
cropped_teru = grayscale_teru.crop((500, 500, width-500, height-500))
plt.imshow(cropped_teru)
Image "Convolutions" are ways to do some math with the pixel values in images to transform them into something that's more useful for classification. This has a nice explanation of what convolutions are, and we'll use some of the ones they talk about here.
2.3 (2 points) Convolve your original gray Teru picture using the "sharpen" kernel they define, then display it.
HINT: See import.
from scipy.signal import convolve2d
kernel_sharp = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])
gray_terupicture_sharpen = convolve2d(grayscale_teru, kernel_sharp, 'same')
plt.imshow(gray_terupicture_sharpen)
This'll look better if we equalize exposure. Don't worry too much about what this does- it basically just increases contrast.
#Just run this. There'll be a UserWarning. That's fine.
from skimage import exposure
gray_terupicture_sharpen_equalized = exposure.equalize_adapthist(gray_terupicture_sharpen/np.max(np.abs(gray_terupicture_sharpen)), clip_limit=0.03)
plt.imshow(gray_terupicture_sharpen_equalized, cmap=plt.cm.gray)
Next let's try the edge detection kernel they specify, also equalizing exposure as shown above.
2.4a (2 points) Convolve your ORIGINAL gray Teru picture using the edge detection kernel they define, then display it.
kernel_edge = np.array([[-1,-1,-1],[-1,8,-1],[-1,-1,-1]])
gray_terupicture_edge = convolve2d(grayscale_teru, kernel_edge, 'same')
gray_terupicture_edge_equalized = exposure.equalize_adapthist(gray_terupicture_edge/np.max(np.abs(gray_terupicture_edge)), clip_limit=0.03)
plt.imshow(gray_terupicture_edge_equalized, cmap=plt.cm.gray)
It kinda sorta found the edges. Looks like it got tricked by fur patterns though.
2.4b (1 point) Which one or two cats might edge processing work best on? Why?
print("Edge processing would work best on cats with fur patterns that are not obvious. For example, this would be great on black fur, like Hubble has in the sample pictures")
Finally, let's try a kernel that isn't on the page.
2.5 (2 points) Go to the kernels wikipedia page and pick one of the three blur kernels and apply it here.
kernel = np.array([[1,1,1],[1,1,1],[1,1,1]])/9.0;
gray_terupicture_blur = convolve2d(grayscale_teru, kernel, 'same')
plt.imshow(gray_terupicture_blur, cmap=plt.cm.gray)
Okay, I think you get the picture (pun intended).
Let's make a cat classifier. We'll follow the steps described above in the section where we talked about pipelines in image classification: import your data, clean it, visualize it, select features, model, and analyze results.
There are four arrays of data we need to import to get our classifier working, which we'll combine to get two final arrays that can be passed into a classifier:
Let's start with the first. The first thing you may notice if you scroll through the images in the "cats" folder is that some of the files are in .jpg format and a handful are .png.
Taro has classier owners than all of the other HCII cats, so some pictures of her are in .png format. For simplicity's sake, we want all of our images in .jpg format, so write a script to convert them to .jpg. Name the output something like "Taro1.jpg", "Taro2.jpg", etc. In order to do this, you're going to want to open each .png file in the folder individually in sequence and convert it to .jpg, saving it under a different name. I'll let you figure out how to do this, but note that the glob package makes it really easy to select all the files in a folder with a certain extension.
Note that you may get an error if you try to convert directly from .png to .jpg because .png images have a property (transparency) that .jpg files don't. We don't need this property here, so you can ignore the error if you get it.
2.6 (2 points) Convert the .png files in the cats folder to .jpg files and re-save them as new files
import glob, shutil
cat_png = glob.glob('.\Part2_Cat_Classifying\cats\*.png')
for png_img in cat_png:
shutil.copy(png_img, png_img[:-4] + '.jpg')
# If you've done this right up until now, the following should output 301
catfilelist = glob.glob('Part2_Cat_Classifying/cats/*.jpg')
print (len(catfilelist))
Finally, you might notice that the images have lots of different sizes and dimensions represented. Let's convert all the images to the same dimensions. This'll make some of them a little oddly-proportioned, but that's ok.
2.7 (3 points) Import each of the .jpg files and resize them to 256x256 pixels. Then, convert these to grayscale, as we did for the cat image above. Re-save these images with names "phdcat0.jpg" through "phdcat300.jpg".
(Sorry for your disk space, though they'll be fairly small files after this).
cat_jpg = glob.glob('.\Part2_Cat_Classifying\cats\*.jpg')
i = 0
for jpg_img in cat_jpg:
img = Image.open(jpg_img)
resized_img = img.resize((256,256))
gray_img = resized_img.convert('L')
gray_img.save('.\Part2_Cat_Classifying\cats\phdcat'+str(i)+'.jpg')
i=i+1
If you did this right, the following should again print out 301:
# Just run this
catfilelist = glob.glob('Part2_Cat_Classifying/cats/phdcat*.jpg')
print (len(catfilelist))
(It's probably useful to remove the old .png files and original (color) .jpg images at this point, just in case you accidentally include them when you didn't mean to. You might want to save a copy of them somewhere in case you mess something up.)
Now it's finally time to import the images into an array you can use for classification.
2.8a (2 points) Put all of the phdcats images in a numpy array called "catimages".
HINT: this will actually be a numpy array of numpy arrays, where the large array contains an array for each of the pictures, and the picture arrays contain grayscale picture values.
catimages= np.empty((301,), dtype = object)
for i in range(301):
catimages[i] = np.array(Image.open('.\Part2_Cat_Classifying\cats\phdcat'+str(i)+'.jpg'))
2.8b (1 point) What does the code below do? What do each of the numbers mean?
The cell below prints out the shape of the overarching cat images. From the first line, we see that this is an array of 301 cat images. The second line shows us the size of the 0-th index of catimages, which is 256,256 since that is what we resized it to.
print(catimages.shape)
print(catimages[0].shape)
Let's take a look at what's in this new numpy array to make sure it makes sense:
# Just run this.
# This makes a nice 4x4 display to show images numbers 0-15 in the array,
# and the underlying numerical data for four of them
fig=plt.figure(figsize=(16, 16))
for i in range(1, 17):
img = catimages[i-1]
fig.add_subplot(4, 4, i)
plt.imshow(img,cmap=plt.cm.gray)
print (catimages[:4])
Status checkpoint:
You can do the exact same thing to clean and import the scenery pictures that you did for the cat pictures. Fortunately, they're all already .jpg files.
2.9 (3 points) Convert the scenery images to grayscale and 256x256, save them as "scenery0.jpg" through "scenery332.jpg" and then put them in a numpy array called "sceneryimages". Do this all in one cell below.
(Note that we're using the word "scenery" loosely here. They're just a bunch of images without animals or people in them)
import glob
scenery_jpg = glob.glob('.\Part2_Cat_Classifying\scenery\*.jpg')
i = 0
for jpg_img in scenery_jpg:
img = Image.open(jpg_img)
resized_img = img.resize((256,256))
gray_img = resized_img.convert('L')
gray_img.save('.\Part2_Cat_Classifying\scenery\scenery'+str(i)+'.jpg')
i=i+1
sceneryimages= np.empty((333,), dtype = object)
for i in range(333):
sceneryimages[i] = np.array(Image.open('.\Part2_Cat_Classifying\scenery\scenery'+str(i)+'.jpg'))
# If you've done this correctly, this should print 333
print (len(sceneryimages))
You'll probably want to remove the original scenery .jpg images here too and either delete them or save them somewhere else
Run the following to double check your array:
# This makes the same array for scenery images that we did above for cat images, just to take a look at what's there.
fig=plt.figure(figsize=(16, 16))
for i in range(1, 17):
img = sceneryimages[i-1]
fig.add_subplot(4, 4, i)
plt.imshow(img,cmap=plt.cm.gray)
print (sceneryimages[:4])
Status checkpoint:
Now we have all of our images of cats and non-cats cleaned and imported into numpy arrays. This will serve the source for our Train_X, Validation_X, and Test_X sets once we smush the two arrays together. However, in order to train a classifier we need to give the classifier an array telling it which images are of cats and which are of non-cats (where 1 = cat, and 0 = non-cat). Since we've kept our two types of images separate, it's pretty easy to make an array of ones and an array of zeros and then smush those two arrays together.
Since we know the length of our catimages array and our sceneryimages array- 2.10a (1 point) Make a one-dimensional array full of ones as long as the catimages array 2.10b (1 point) Make a one-dimensional array full of zeros as long as the sceneryimages array.
catimages_y = np.empty((301,),dtype = int)
sceneryimages_y = np.empty((333,),dtype = int)
for i in range (301):
catimages_y[i] = 1
for j in range (333):
sceneryimages_y[j] = 0
# Should print 301 and 333:
print(len(catimages_y), len(sceneryimages_y))
Status checkpoint:
Now we need to smush the two images arrays together to get a final X array. One might call this process... conCATenating. ba dum tss
2.11 (2 point) Create a new array that's one dimensional and contains all the cat images followed by all of the secenery images
All_X = np.append(catimages,sceneryimages)
# Should print (634, 256, 256)
print(All_X.shape)
Now smush the two y arrays together. Make sure you combine them in the correct order - if you put catimages first above, make sure to put catimages_y first here:
All_y = np.append(catimages_y,sceneryimages_y)
# Should print (634,)
print(All_y.shape)
The following will show the first three, middle three, and last three images in your images array, along with the corresponding classifications. This is your last chance to make sure you've combined things properly, so if you have scenery images labeled as "1" or cat images labeled as "0", go back and figure out where things went wrong now.
# Just run this
fig=plt.figure(figsize=(16, 16))
for i in range(1, 4):
img = All_X[i-1]
title = All_y[i-1]
sp = plt.subplot(4, 4, i)
sp.set_title(title)
plt.imshow(img,cmap=plt.cm.gray)
fig=plt.figure(figsize=(16, 16))
for i in range(318, 321):
img = All_X[i-1]
title = All_y[i-1]
sp = plt.subplot(4, 4, i-317)
sp.set_title(title)
plt.imshow(img,cmap=plt.cm.gray)
fig=plt.figure(figsize=(16, 16))
for i in range(632, 635):
img = All_X[i-1]
title = All_y[i-1]
sp = plt.subplot(4, 4, i-631)
sp.set_title(title)
plt.imshow(img,cmap=plt.cm.gray)
One last step before we split these into train and test- we need to flatten them to two-dimensional arrays because that's what our classifiers will accept. This is pretty simple:
# Just run this
All_Flat_X = np.array([image.flatten() for image in All_X])
print(All_Flat_X.shape)
Now, as we did above, we need to split these final arrays into train, validation, and test sets randomly. Do this below. Let's do 60% of the data into training, 20% into validation, and 20% into testing.
There are a ton of ways to do this, but we're giving you a hint that should lead you toward a very simple way. Note that you NEED to take a random subset here. If you just put the first 60% into train, and the next 40% split between validation and test, your validation and test sets will only have scenery images in them because your data isn't in a random order right now.
2.12 (5 points) Split the data into X and y for Train, Validation, and Test sets
from sklearn.model_selection import train_test_split
# This only splits a numpy array into two random sub-arrays,
# but if you're clever you can also use it to get a validation set with just one more line of code
Train_X, Other_X, Train_y, Other_y = train_test_split(All_Flat_X, All_y, test_size=0.60)
Validation_X, Test_X, Validation_y, Test_y = train_test_split(Other_X, Other_y, test_size=0.50)
# Print out the following
print (len(Train_X), len(Train_y), len(Test_X), len(Test_y), len(Validation_X), len(Validation_y))
print (Train_y)
print (Validation_y)
print (Test_y)
print (Train_X[0])
Alright, let's test out some classifiers. We'll use the same ones as we did above for the text classification task.
2.13a (8 points) Train models based on each of the four classifiers listed here plus one more you pick. Write down average accuracy and F1 scores again as you did above.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
classifier = RandomForestClassifier()
classifier.fit(Train_X,Train_y)
scores = cross_val_score(classifier, Validation_X, Validation_y, cv = 5)
scores2 = cross_val_score(classifier, Validation_X, Validation_y, cv = 5, scoring = 'f1_weighted')
#Write down the stats the same way as you did above, 3 digits is fine
print("KNeighborsClassifier_AvgAccuracy = 0.5948")
print("KNeighborsClassifier_AvgF1 = 0.5926")
print("MultinomialNB_AvgAccuracy = 0.5686")
print("MultinomialNB_AvgF1 = 0.5635")
print("LogisticRegression_AvgAccuracy = 0.5729")
print("LogisticRegression_AvgF1 = 0.5722")
print("RandomForestClassifier_AvgAccuracy = 0.6095")
print("RandomForestClassifier_AvgF1 = 0.5673")
print("YourPick(DecisionTree)_AvgAccuracy = 0.4996")
print("YourPick(DecisionTree)_AvgF1 = 0.5515")
2.13b (8 points) Pick the best classifier you found and use it on the train and test data below. Print out your accuracy, weighted f1 score, and confusion matrix.
There might not be a clear winner, so don't worry too much about which to pick
classifierBest = RandomForestClassifier()
classifierBest.fit(Train_X,Train_y)
scores = cross_val_score(classifierBest, Test_X, Test_y, cv = 5)
scores2 = cross_val_score(classifierBest, Test_X, Test_y, cv = 5, scoring = 'f1_weighted')
sum1 = 0
for s in scores:
sum1+=s
avg1 = sum1/5
print("BestClassifier_AvgAccuracy = "+ str(avg1))
sum2 = 0
for s in scores2:
sum2+=s
avg2 = sum2/5
print("BestClassifier_AvgF1 = "+ str(avg2))
Test_y_pred = classifierBest.predict(Test_X)
confusion_matrix(Test_y, Test_y_pred)
2.13c (2 points) What would your accuracy be if you had just picked the majority class from the training set every time? Be careful you know what the majority class is!
#The majority class is the larger class, which is scenery.
#If our code picked scenery every time, we would have an accuracy of 333/634
accuracy= 333/634
print (accuracy)
Okay. If your results are anything like ours, your classifier is very slightly better than random at guessing which images have cats in them.
As it turns out, image classification is a really hard problem, and in the process of cleaning this data we've removed a whole lot of information. We took away the color, made the images smaller and changed them away from their natural aspect ratios, and we flattened them out.
We tried very hard in writing this assignment to scaffold creation of a good image classifier from the ground up, but we couldn't find a way to do it that didn't involve a lot of hand-waving and saying "trust us". So we had you make a pretty bad image classifier from the ground up.
There are (for better or worse) lots of premade image classification algorithms that do much better on most datasets. In fact, if you google/youtube around for image classification you'll find almost nobody recommending an approach that involves building a classifier from scratch. Most everybody just says use TensorFlow. The following section is entirely optional (but don't forget about the third section that follows it, which is NOT optional!). You'll get some bonus points if you do this optional section, and you can see what a really high-accuracy image classifier looks like (kinda). Hopefully you'll also get to have some idea of what TensorFlow is, and it's a very widely used package in industry right now.
TensorFlow is an open source library from Google that does a variety of things across data science including complex classifications. It's hard to find an answer to what exactly TensorFlow does that's comprehensible to a layperson. The closest approximation is that TensorFlow has a bunch of super-long premade pipelines for different applications where they've found all sorts of data convolutions and other math to transform data that makes subsequent classification of it better.
We're going to quickly run through a TensorFlow image classification model using tf.keras.
This code and the code in the "BONUS_tensorflow-for-poets-2" folder is pulled (slightly modified) from TensorFlow's tutorial here
# Use a terminal/console to navigate to the BONUS_tensorflow-for-poets-2 folder that you downloaded.
# Paste the following into terminal once you're there. (Don't paste the triple quotes or semicolon)
"""
python -m scripts.retrain \
--bottleneck_dir=tf_files/bottlenecks \
--how_many_training_steps=500 \
--model_dir=tf_files/models/ \
--summaries_dir=tf_files/training_summaries/"mobilenet_0.50_224" \
--output_graph=tf_files/retrained_graph.pb \
--output_labels=tf_files/retrained_labels.txt \
--architecture="mobilenet_0.50_224" \
--image_dir=tf_files/catsvsnotcats/
"""
;
BONUS.1 (0.5 point each, 4 points total): Explain what the inputs to retrain.py above do in no more than 15 words for each.
print (len(("Input 1 gives a directory for the bottleneck, which encourages compression of features").split()) < 16)
print (len(("Input 2 gives the number of steps that the model will train on").split()) < 16)
print (len(("Input 3 gives the directory where we store the model of the pipelines").split()) < 16)
print (len(("Input 4 gives the directory of where to store summary data once it is generated").split()) < 16)
print (len(("Input 5 gives the directory where we will store the graph of data collected").split()) < 16)
print (len(("Input 6 gives the directory where we wlll store the labels of the data collected").split()) < 16)
print (len(("Input 7 gives the family of neural networks we will use to train a dataset").split()) < 16)
print (len(("Input 8 gives the directory of the images we want to train on").split()) < 16)
#Keep going up to 8
While the output of the script is whizzing by, you'll see lots of accuracy numbers that are probably very high. MUCH higher than the previous model. Your final accuracy should probably be higher than 95%, if not 100% on the test data.
Let's see how it does on some individual images we held out:
# While in the same folder in the terminal, paste the following script.
# It'll make a prediction for one of the files we held out of the training
"""
python -m scripts.label_image
--graph=tf_files/retrained_graph.pb
--image=tf_files/heldout/phdcat286.jpg
"""
;
You might get an output that looks something like this:
"cats (score=0.98744) scenery (score=0.01256)"
That basically means that it's really sure that it's a cat, which is correct in this case (which is kind of impressive given how little of the cat is shown in this particular image). You can try it on any of the other images in the heldout folder, and it'll give you similar predictions.
This model is going to be REALLY good at identifying cats vs not cat images from this dataset. How exactly does it work, you ask?
¯\(ツ)/¯ No idea. But it works!
Let's try to see if we can fool TensorFlow.
BONUS.2 (1 point or 3 points): Find an image of a cat somewhere online that wasn't included in this dataset. It must have only one cat, and no other animals or humans. All of the cat must be visible such that a human could easily tell that there's a cat in the picture. 1 point: get it so TensorFlow is less than 80% sure that there's a cat in the picture, based on the model you train on the phdcats data. 2 additional points: get it so TensorFlow guesses incorrectly that you have a picture of scenery. Display the image you found below along with the cats and scenery scores you found.
You might have to try a few pictures to get this to work. No cheating by blurring or scribbling over or otherwise screwing with your image is allowed!
This is the final part of the assignment!
Labeled Faces in the Wild (LFW) is a well-known repository of human faces originally based on people who were in the news about a decade ago. The most common face in the database is former US president George W. Bush, who was president at the time.
Through subsequent work, the original authors of the database have attached values for 73 attributes to each image, from "Pale Skin" to "Senior" to "Eyeglasses" to "Male", where a higher value indicates that the image is more representative of this label. The basis of these attributes comes from MTurk worker ratings, but a variety of fancy math led to the values they have now. You can see the full file with attributes for all of the 13,000+ images in the dataset here.
For this activity, we're going to be working off a dataset made entirely of images of George W. Bush. You can find these images in the appropriately-named "George_W_Bush" folder, along with a file called "bushattributes.csv" that shows the attributes of each of the images. If you look in this csv file, you can see that images of Bush are generally labeled with values > 0 in the dataset for Male, White, and not wearing lipstick, and < 0 for Black, Baby, and Heavy Makeup. Seems reasonable, as Bush is White and Male and doesn't usually wear lipstick. For whatever it's worth, the average "Attractive Man" score for Bush is slightly higher than the overall average in the Labeled Faces in the Wild dataset.
As there is variance in the attributes among all of these images (e.g., Bush has a higher "White" score in some images), we can train a classifier on these images and use it to predict attributes of other unseen images. A reasonable-ish (though low-quality result) exercise would be to use this to predict attributes of other images in the dataset. An unreasonable thing to do would be to have it predict attributes of images of you, which, of course, is what we're going to do.
Our importing won't be quite as easy as for the cat pictures because we can't just set all attributes to 1 or 0. We need to make sure we're importing in the same order for the images and for the attributes. Note that we've already converted the images to gray for you, though we've kept them at their original dimensions of 250x250. In the original dataset they're 250x250 and color.
3.1 (1 points): Import the images of George W Bush into a numpy array called bushimages.
bushimages = np.empty((524,), dtype = object)
bush_jpg = glob.glob('.\Part3_George_W_Bush\George_W_Bush\*.jpg')
i = 0
for img in bush_jpg:
bushimages[i] = np.array(Image.open(img))
i+=1
#Should return 524
print(len(bushimages))
As we did a bunch of times above, let's double check what we have in the array.
3.2 (1 points): Plot the first 16 images in your array in a 4x4 grid, and output the numerical contents of the first four of them, just as we did above
fig=plt.figure(figsize=(16, 16))
for i in range(1, 17):
img = bushimages[i-1]
fig.add_subplot(4, 4, i)
plt.imshow(img,cmap=plt.cm.gray)
print (bushimages[:4])
3.3 (1 points): Flatten these images, as we did in the cats/scenery part above.
All_Bush_Flat_X = np.array([image.flatten() for image in bushimages])
# Should print (524, 62500)
print(All_Bush_Flat_X.shape)
Next we need to come up with what we're going to predict. We can do this by importing columns from the bushattributes csv file. If you need some hints, you can look back at A1 where you imported stuff from csv files.
3.4 (1 points): Import all of the attributes from the bushattributes.csv file. Attribute names should be column headers and rows should be the photo the attribute is tied to.
import pandas
trainingdata = pandas.read_csv("bushattributes.csv", header = 0)
trainingdata = trainingdata._get_numeric_data()
trainingdata
The things we want to predict need to be one-dimensional arrays with length equal to the number of photos (524, in this case).
3.5 (1 points): Create separate numpy output arrays for the "Male", "White", "Black", "Asian", "Eyeglasses", and "Mouth Closed" columns. Then, pick any three other attributes you'd like to predict out of the 73 and create output arrays for these too.
male_array = np.array(trainingdata['Male'])
white_array = np.array(trainingdata['White'])
black_array = np.array(trainingdata['Black'])
asian_array = np.array(trainingdata['Asian'])
eyeglasses_array = np.array(trainingdata['Eyeglasses'])
mouthclosed_array = np.array(trainingdata['Mouth Closed'])
youth_array = np.array(trainingdata['Youth'])
senior_array = np.array(trainingdata['Senior'])
paleskin_array = np.array(trainingdata['Pale Skin'])
We took care of the validation etc. testing for you this time. Results: all the models are pretty bad. But that's okay, we've got enough to pitch to get VC dollars. (That was a joke. Mostly).
Go ahead and use ExtraTreesRegressor as a classifier. It'll take a while to run, but it's among the least terrible.
3.6 (4 points): Put five images of yourself in the "You" folder and convert them to grayscale and 250x250 (you can use whatever images you like, you won't need to submit them to us so you'll be the only one to see them). Then import them and flatten them and use them as a set to make predictions on using this classifier. Run the classifier nine times - once to predict each of the six attributes we specified above for all five of your photos, and once to predict each of the three attributes that you chose for all five of your photos.
Save the numbers you get each time somewhere, but please don't copy/paste the code into nine successive cells- just keep it in one cell and edit the predicted array each time. It'll make this much easier for us to read.
Grab a drink or find something good on Netflix - these models take a while to run. On the three-year-old PowerBook that this is being written on, it's taking about two minutes per model.
#this block just makes grayscale 250x250 pictures of me.
from sklearn.tree import ExtraTreeRegressor
tom_jpg = glob.glob('.\Part3_George_W_Bush\You\*.jpg')
tomimages = np.empty((5,),dtype=object)
i = 0
for img in tom_jpg:
tomimages[i] = np.array(Image.open(img))
i+=1
All_Tom_Flat_X = np.array([image.flatten() for image in tomimages])
classifier = ExtraTreeRegressor()
classifier.fit(All_Bush_Flat_X,paleskin_array)
Male_predicted = classifier.predict(All_Tom_Flat_X)
print("Thomas's Average Male-ness: 1.7112")
print("Thomas's Average White-ness: 0.762")
print("Thomas's Average Black-ness: -2.054")
print("Thomas's Average Asian-ness: -0.979")
print("Thomas's Average Eyeglasses-ness: -0.9605")
print("Thomas's Average MouthClosed-ness: 0.4346")
print("Thomas's Average Youth-ness: -0.4918")
print("Thomas's Average Senior-ness: -0.4380")
print("Thomas's Average Paleskin-ness: 0.9722")
Okay. With that all done:
3.7 (0.33 points each): On a scale of 1-5 for each, how well did the classifier classify your images for each of the above attributes?
#Fill in below:
print("Male: 5 | I am pretty male")
print("White: 4 | I am only a small degree of white")
print("Black: 5 | I am not at all black")
print("Asian: 1 | I am incredibly asian")
print("Eyeglasses: 1 | I always wear eyeglasses")
print("Mouth closed: 3 | My mouth does open")
print("Youth: 2 | I am only 19 ")
print("Senior: 4 | I am only 19 ")
print("Pale Skin: 5 | I am relatively pale, but not that pale ")
A model trained exclusively on pictures of George W. Bush is clearly going to be a biased dataset, as the data is not at all representative of... anybody other than George W. Bush, really.
As noted above, the broader dataset here comes from photos of people in the news in the mid-2000s, and George W. Bush is the most-represented person in this dataset.
3.8 (3 points) Where (presumably other than news articles from the mid-2000s) would you scrape faces from to get a dataset that you think would be quite diverse but also that you'd be confident faces like yours would be represented in? Why did you pick this dataset? (no more than 50 words).
print (len(("I would pick the profile pictures of people in the CMU Facebook groups. In general, I believe CMU is a very diverse school. There are definately many different races and cultures, including my own and all of our faces are on Facebook.").split()) < 51)
3.9 (2 points each, 10 total): Assuming this classifier achieved significantly better scores for model quality than it currently does, which of the following would you be comfortable using it for? Why or why not? (30 words or less for each).
# Picking which photos of you are most attractive to use on Tinder/Grindr/Bumble/Coffee Meets Bagel/OkCupid...
print (len(("I believe this would be a fine use of the classifier. Through this, we can objectively tell whether our pictures are generally good or not").split()) < 31)
# Automating your Tinder swiping by picking people who are above a certain attractiveness threshhold.
# (This has actually been done, though not with the George W Bush dataset)
print (len(("I would not be comfortable doing this, because tastes are subjective, and our tastes do not always align with the general public.").split()) < 31)
# Automatically detecting race in photos of college applicants to verify what they enter.
print (len(("I would not be comfortable using it for this, as minor errors can cause a lot of disruption and such data can be used for discrimination").split()) < 31)
# Detecting people who are wearing glasses in their Facebook photos in order to target glasses ads to them
print (len(("I would be comfortable using this. The risks are not high and there is a small chance that this will cause unrest or etihical issues.").split()) < 31)
# Looking at professional photos of speakers at an event to count how many speakers are female
print (len(("I would be comfortable doing this. This could be useful in locating certain talks of interest to a niche audience.").split()) < 31)
3.10 (BONUS, up to 5 points): Go download the full Labeled Faces in the Wild attributes csv, the one with 13,000+ rows. You DO NOT have to download all the photos. For all nine of the attributes you've been using above, find the average value across the whole dataset. Next, go skim the paper where they talk about how the attribute values were generated. What do each of these attributes mean? (no word limit, but be kind to us)
print("Average Male: -1.083")
print("Average White: -1.659")
print("Average Black: -1.733")
print("Average Asian: 0.541")
print("Average Eyeglasses: -1.263")
print("Average MouthClosed: -0.254")
print("Average Youth: -0.390")
print("Average Senior: -0.863")
print("Average Pale Skin: -0.2439")
print("In order to obtain the training data, they used Mechanical Turk. The attributes themselves were generated in order to be representative of a visual appearnce. They then had binary inputs(1/0) from Turkers and averaged them out to get the data for a specific person.")
3.11 (BONUS, up to 3 points): Find the total range of values (max-min) for each of the nine attributes in the full csv. Should be around 5 to 10. What proportion of this total range for each attribute is covered by the range in the George Bush attributes?
print("TOTAL RANGES")
print("Total Male Range: 5.965")
print("Total White Range: 7.687")
print("Total Black Range: 4.497")
print("Total Asian Range: 6.141")
print("Total Eyeglasses Range: 4.646")
print("Total MouthClosed Range: 4.677")
print("Total Youth Range: 2.805")
print("Total Senior Range: 5.749")
print("Total Pale Skin Range: 7.384")
print("")
print("GWB RANGES")
print("GWB Male Range: 3.227")
print("GWB White Range: 3.417")
print("GWB Black Range: 3.527 ")
print("GWB Asian Range: 2.643 ")
print("GWB Eyeglasses Range: 3.279")
print("GWB MouthClosed Range: 4.329")
print("GWB Youth Range: 2.414")
print("GWB Senior Range: 2.220")
print("GWB Pale Skin Range: 5.223")
print("")
print("GWB PROPORTIONAL COVERAGE")
print("GWB Male Proportion: " + str(3.227/5.965))
print("GWB White Proportion: "+ str(3.417/7.687))
print("GWB Black Proportion: "+ str(3.527/4.497))
print("GWB Asian Proportion: "+ str(2.643/6.141))
print("GWB Eyeglasses Proportion: "+ str(3.279/4.646))
print("GWB MouthClosed Proportion: "+ str(4.329/4.677))
print("GWB Youth Proportion: "+ str(2.414/2.805))
print("GWB Senior Proportion: "+ str(2.220/5.749))
print("GWB Pale Skin Proportion: "+ str(5.223/7.384))
Once you've completed all of the above, you're done with assignment 3! You might want to double check that your code works like you expect. You can do this by choosing "Restart & Run All" in the Kernel menu. If it outputs errors, you may want to go back and check what you've done.
Once you think everything is set, please run ALL of your code, download your final notebook as HTML, and submit to the A3 folder on the Canvas site with name [yourandrewid]_haiif18a[assignmentnumber], e.g., jseering_haiif18a3.